For teams in the National Basketball Association (NBA), a key strategy to win more games is to properly allocate their salary cap - an agreement that places a limit on the amount of money that a team can spend on players’ salaries. How to evaluate the performance of each NBA player and give a suitable level of salary is a complicated problem. In this project, we intend to predict the salary of NBA players in the 2021-2022 season based on their game statistics. We collected game statistics that are commonly used to evaluate players from the NBA official website, built linear, generalized linear, tree-based, and blackbox models, and conducted model comparison to determine a final predictive model.
The data used for exploratory analysis and model construction are based on two datasets: the NBA players’ contracted salary dataset [1] and game performance statistics dataset [2] in 2021-2022. The data were preprocessed using the following pipeline:
field_goal, free_throw, etc.) by variable minute to convert them to efficiency.The final cleaned dataset contained 442 records and 24 variables, including 2 categorical variables, 21 numerical variables and 1 numeric response variable salary.
| Variable Name | Meaning | Varianble Type |
|---|---|---|
| position | Position of the player | categorical (5 classes) |
| age | Player’s age on February 1 of the season | numeric |
| team | Team that the player belong to | categorical (30 classes) |
| game | Number of games played | numeric |
| game_starting | Number of games played as a starter | numeric |
| minute | Minutes played per game | numeric |
| field_goal | Field goals per minute | numeric |
| fg_attempt | Field goal attempts per minute | numeric |
| x3p | 3-point field goals per minute | numeric |
| x3p_attempt | 3-point field goal attempts per minute | numeric |
| x2p | 2-point field goals per minute | numeric |
| x2p_attempt | 2-point field goal attempts per minute | numeric |
| free_throw | Free throws per minute | numeric |
| ft_attempt | Free throw attempts per minute | numeric |
| offensive_rb | Offensive rebounds per minute | numeric |
| defenssive_rb | Defensive rebounds per minute | numeric |
| total_rb | Total rebounds per minute | numeric |
| assistance | Assists per minute | numeric |
| steal | Steals per minute | numeric |
| block | Blocks per minute | numeric |
| turnover | Turnovers per minute | numeric |
| personal_foul | Personal fouls per minute | numeric |
| point | Points per minute | numeric |
| salary | Salary of the player in million (Response) | numeric |
The distribution of each variable in the dataset was plotted and examined (see Appendix.A). Categorical variables team and position were distributed quite evenly. While 30 unique values in team might lead to too many dummy variables in the model, the variable was considered to be excluded or clustered into fewer classes for model fitting. For numeric variables, some of them (gs, ft, orb,blk), including response salary are skewed, with some players have extremely high salary.
From the correlation heat map (figure 1A), some variables were found to be strongly correlated with others, leading to a potential problem of multicolinearity. Such problem could be handled by using penalized models (such as ridge/lasso regression and elastic-net) or ensembled models (such as random forest, boosting, and neural network). From the feature maps (figure 1B), some predictive variables (such as age,game, game_starting,free_throw, personal_foul, point) were found to have non-linear correlations with salary. Generalized linear regression models and other non-linear models, including GAM, MARS, random forest and neural network model, could be used to capture such non-linearity.
As for the categorical variable position and team, extremely high values and large variance in salary were observed in all positions and some teams (see Appendix.A).
Correlation Heatmap and Feature Maps
To condense the 30-class categorical variable team into fewer dummy variables, we tried clustering team into fewer classes based on the median and standard deviation of player’s salary in each team. The number of clusters \(k = 3\) was chosen based on average silhouette width. The resulting clusters of team were stored in a new variableteam_cluster and presented below :
Clustering on variable team
In the tree-based models, the replacement of team with team_cluster led to a higher prediction accuracy. For other models, the variable teamwas still used for model construction.
After having an overview of the data, we splitted the dataset into training (80%) and test (20%) set, used 10-fold repeated cross validation to compare each model using training data, and evaluated the model performances based on test error. We built 8 types of models in four categories:
Linear Regression Models: Standard Linear Regression Model, Elastic Net Model, Principal Component Regression Model (PCR)
Generalized Linear Regression: Generalized Addictive Model (GAM), Multivariate Adaptive Regression Splines Model (MARS)
Tree based Models Models: Random Forest (RF), Generalized Boosted Regression Modeling (GBM)
Blackbox Model: Neural Network
There is no tuning parameter for standard least-squared model.
The elastic-net model has two parameter, which are alpha (compromise between LASSO and ridge) and lambda (the penalty term limits the number or magnitude of predictor coefficients). The elastic-net model reached its best tune at \(\alpha = 1\) and \(\lambda = 0.44\) (see Appendix.B figure1).
The tuning parameter of PCR is the number of predictors included in the final model. There are 12 components included in the model with minimum RMSE (see Appendix.B figure2).
There is no tuning parameter for GAM. The GAM model can capture the non-linear trend in the model, but it may have a high variance. age, game_starting, assistance, personal_foul, and point are statistically significant predictors at 0.0001 significant level.
The tuning parameter for MARS is nprune and degree. When attempting to fit the MARS model, we noticed that the RMSE increased drastically when degree is over 3 and nprune is over 8. Therefore, we would choose the range of degrees as 1:4 and range of nprune as 2:8. When number of terms is 6 and product degree is 3, MARS model reached its best tune and RMSE is lowest. The MARS model selected 6 of 69 terms, and 6 of 54 predictors. And the top 3 important predictors are: age, minute, game. MARS model is highly adaptive comparing with previous models and has a higher prediction accuracy (see Appendix.B figure3).
Categorical variable team have 30 classes, which will result in too much dummy variables in our models. Therefore, we consider clustering team into fewer class according to similar trends in the median and standard deviation of player’s salary in each team. We replace team with newly generated variable team_cluster, which contains values 1, 2, and 3 representing each clusters.
Tuning parameter for random forest regression in package ranger are mtry, number of variables to split at in each node; and min.node.size, minimal size of each node. Through 10-fold repeated cv, the optimal random forest model have parameters mtry = 26 and min.node.size = 1. Random forest preserve the advantage of single decision trees that can handle correlation between variables and non-linearity. However, since here mtry = 26 equals our total number of variables, this random forest estimator may not well decorrelate single trees, and thus may overfit the dataset.
Tuning parameters for Generalized boosted regression modeling (GBM) are n.trees, the total number of trees to fit; interaction.depth: maximum depth of each tree; shrinkage, learning rate; and n.minobsinnode, the minimum number of observations in the terminal nodes of the trees. Through 10-fold repeated cv, the optimal random forest model have parameters n.trees = 6000, interaction.depth = 5, shrinkage = 0.0008, and n.minobsinnode = 1.
Several 2-hidden layer neural networks were built to fit the data. Despite trying different number of nodes and applying regularization techniques (L2 and dropout), the resulting models still have a noticeable overfitting problem. Given the size of the dataset is very small (\(n=442\)), the performance of neural network is not as good as some traditional statistical models. It is more useful when the size of dataset is much larger with more variables. The optimal model fitted after parameter tuning is a 2-hidden layer neural network with \(n_1 = 10\) and \(n_2 = 5\) nodes in the first and second layers, applying dropout regularization. The figure shows the resulting MSE in the training and validation sets after 250 epochs.
MSE of the resulting 2-hidden layer neural network in training and validation sets
As shown in the figure, as the number of nodes in the first and second hidden layers increases, neural networks can provide very accurate fittings of the training data, with much lower MSEs compared to other methods. However, the predictions are not satisfying when the models are applying to the test data.
Performance of neural networks with different number of hidden layers
The 10-fold CV RMSE (validation set rmse for neural network) and test set RMSE for all candidate models are shown in the following table. The GBM model has the best performance in terms of both cv and test errors.
| Linear | ElasticNet | PCR | GAM | MARS | RandomForest | GBM | NeuralNetwork | |
|---|---|---|---|---|---|---|---|---|
| Training RMSE | 6.79 | 6.45 | 7.16 | 6.84 | 6.06 | 5.42 | 5.41 | 6.40 |
| Test RMSE | 6.66 | 6.04 | 5.39 | 6.84 | 5.16 | 4.83 | 4.75 | 6.64 |
Based on the following Our best model is the Generalized Boosted Regression Modeling (GBM) with tuning parameters:
n.trees = 6000
interaction.depth = 5
shrinkage = 0.0008:
n.minobsinnode = 1
10 most important variables (computed from permuting OOB data) are minute, age, point, free_throw, fg_attempt, game_starting, assistance, ft_attempt, team_cluster, and defensive_rb.
With our fitted GBM model, we can make prediction on new observations. The RMSE on our test data is 4.745948.
Given that GBM is a black-box model, we refer to lime package to achieve explanations of the result of the model on new observations, by fitting a simpler model to the permuted data with the above 15 most important features. We randomly selected 6 observations of the test data. The players’ name, true salary (in million), and predicted salary from GBM are:
| player | true_salary | predicted_salary |
|---|---|---|
| Cade Cunningham | 10.050120 | 8.590867 |
| Cam Reddish | 4.670160 | 3.627087 |
| Christian Wood | 13.666667 | 14.730414 |
| Corey Kispert | 3.383640 | 3.248053 |
| D’Angelo Russell | 30.013500 | 14.735256 |
| Danuel House Jr. | 2.045094 | 4.792719 |
The explanation of the GBM model from lime are shown in the following figure. Inside the plot, the x-axis shows the relative strength of each variables, and positive values (blue) show that the the variable increase the value of the prediction, while the negative values (red) decrease the prediction value.
Take the first case of player Cade Cunningham as an example. Cade’s true salary is 10.050120 million. His predicted salary from GBM is 8.590868 million, which are quite similar to each other. Among the 10 most important variables, factors
mintue > 90, point > 0.5, game_starting > 43.8, assistance > 0.1241, fg_attempt > 0.401 and x2p_attempt > 0.269 increases Cade’s salary, while factors age <= 23, turnover > 0.065, 0.057 < free_throw <= 0.082 and team_cluster = 2 decreases his salary.
[1]https://www.basketball-reference.com/contracts/players.html
[2]https://www.basketball-reference.com/leagues/NBA_2022_per_game.html